library(tidyverse)
library(rpart)
library(rpart.plot)
avocado <- read_csv("data/avocado.csv")
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
Date = [34mcol_date(format = "")[39m,
AveragePrice = [32mcol_double()[39m,
`Total Volume` = [32mcol_double()[39m,
`4046` = [32mcol_double()[39m,
`4225` = [32mcol_double()[39m,
`4770` = [32mcol_double()[39m,
`Total Bags` = [32mcol_double()[39m,
`Small Bags` = [32mcol_double()[39m,
`Large Bags` = [32mcol_double()[39m,
`XLarge Bags` = [32mcol_double()[39m,
type = [31mcol_character()[39m,
year = [32mcol_double()[39m,
region = [31mcol_character()[39m
)
head(avocado)
summary(avocado)
X1 Date AveragePrice Total Volume
Min. : 0.00 Min. :2015-01-04 Min. :0.440 Min. : 85
1st Qu.:10.00 1st Qu.:2015-10-25 1st Qu.:1.100 1st Qu.: 10839
Median :24.00 Median :2016-08-14 Median :1.370 Median : 107377
Mean :24.23 Mean :2016-08-13 Mean :1.406 Mean : 850644
3rd Qu.:38.00 3rd Qu.:2017-06-04 3rd Qu.:1.660 3rd Qu.: 432962
Max. :52.00 Max. :2018-03-25 Max. :3.250 Max. :62505647
4046 4225 4770 Total Bags
Min. : 0 Min. : 0 Min. : 0 Min. : 0
1st Qu.: 854 1st Qu.: 3009 1st Qu.: 0 1st Qu.: 5089
Median : 8645 Median : 29061 Median : 185 Median : 39744
Mean : 293008 Mean : 295155 Mean : 22840 Mean : 239639
3rd Qu.: 111020 3rd Qu.: 150207 3rd Qu.: 6243 3rd Qu.: 110783
Max. :22743616 Max. :20470573 Max. :2546439 Max. :19373134
Small Bags Large Bags XLarge Bags type
Min. : 0 Min. : 0 Min. : 0.0 Length:18249
1st Qu.: 2849 1st Qu.: 127 1st Qu.: 0.0 Class :character
Median : 26363 Median : 2648 Median : 0.0 Mode :character
Mean : 182195 Mean : 54338 Mean : 3106.4
3rd Qu.: 83338 3rd Qu.: 22029 3rd Qu.: 132.5
Max. :13384587 Max. :5719097 Max. :551693.7
year region
Min. :2015 Length:18249
1st Qu.:2015 Class :character
Median :2016 Mode :character
Mean :2016
3rd Qu.:2017
Max. :2018
library(janitor)
Attaching package: ‘janitor’
The following objects are masked from ‘package:stats’:
chisq.test, fisher.test
clean_avocado <- clean_names(avocado)
clean_avocado
clean_avocado <- clean_avocado %>%
select(-c(x1, date))
clean_avocado
n_data <- nrow(clean_avocado)
test_index <- sample(1:n_data, size = n_data*0.2)
avocado_test <- slice(clean_avocado, test_index)
avocado_train <- slice(clean_avocado, -test_index)
avocado_test %>%
tabyl(type)
type n percent
conventional 1834 0.5026035
organic 1815 0.4973965
avocado_train %>%
tabyl(type)
type n percent
conventional 7292 0.4994521
organic 7308 0.5005479
avocado_fit <- rpart(type ~ .,
data = avocado_train,
method = 'class')
rpart.plot(avocado_fit, yesno = 2)
rpart.rules(avocado_fit, cover = TRUE)
library(modelr)
avocado_test_pred <- avocado_test %>%
add_predictions(avocado_fit, type = 'class')
avocado_test_pred %>%
select(type)
rpart.predict(avocado_fit, newdata=avocado_test[1:3,], rules=TRUE)
Error in `$<-.data.frame`(`*tmp*`, "rule", value = character(0)) :
replacement has 0 rows, data has 3
library(yardstick)
For binary classification, the first factor level is assumed to be the event.
Set the global option `yardstick.event_first` to `FALSE` to change this.
Attaching package: ‘yardstick’
The following objects are masked from ‘package:modelr’:
mae, mape, rmse
The following object is masked from ‘package:readr’:
spec
conf_mat <- avocado_test_pred %>%
conf_mat(truth = type, estimate = pred)
Error: `type` should be a factor.
Clustering homework
computers <- read_csv("data/computers.csv") %>%
clean_names()
Missing column names filled in: 'X1' [1]Parsed with column specification:
cols(
X1 = [32mcol_double()[39m,
price = [32mcol_double()[39m,
speed = [32mcol_double()[39m,
hd = [32mcol_double()[39m,
ram = [32mcol_double()[39m,
screen = [32mcol_double()[39m,
cd = [31mcol_character()[39m,
multi = [31mcol_character()[39m,
premium = [31mcol_character()[39m,
ads = [32mcol_double()[39m,
trend = [32mcol_double()[39m
)
computers
computers <- computers %>%
select(hd, ram)
computers
computer_scale <- computers %>%
mutate_all(scale)
computer_scale
ggplot(computer_scale, aes(hd, ram)) +
geom_point()
It looks there are definite clusters, maybe 6.
clustered_computers <- kmeans(computer_scale, centers = 6, nstart = 25)
clustered_computers
K-means clustering with 6 clusters of sizes 386, 745, 1633, 821, 2358, 316
Cluster means:
hd ram
1 1.9359351 0.56184955
2 -0.4850999 -0.05095751
3 0.2314714 -0.16449026
4 0.4603185 1.36972433
5 -0.8238806 -0.82064414
6 2.5345722 2.84885195
Clustering vector:
[1] 5 5 5 2 4 4 5 5 2 5 2 2 5 2 2 5 5 5 5 3 5 5 4 5 3 5 5 3 4 2 5 5 2 5 5 5 2 3 2 2 2 3 2 2 4 2 5
[48] 5 5 5 5 5 2 5 2 5 5 5 2 5 2 5 2 5 5 5 5 5 5 2 3 5 5 2 5 2 2 2 5 2 5 2 2 2 2 2 2 5 5 2 5 5 2 2
[95] 4 5 5 5 4 5 5 2 5 5 5 2 2 5 5 5 2 5 3 4 5 5 5 2 5 5 5 5 2 2 2 2 5 2 5 5 5 5 5 5 5 2 2 5 2 5 2
[142] 3 2 5 3 2 5 5 2 3 5 2 2 2 2 2 5 2 5 5 5 5 5 2 4 5 2 5 5 2 2 2 5 2 3 5 5 2 5 5 5 5 5 2 3 2 5 2
[189] 4 4 5 5 2 5 5 2 5 5 5 2 3 2 2 5 5 5 2 3 5 3 2 2 2 5 2 5 2 5 4 5 2 2 2 3 2 2 5 4 5 2 2 2 5 2 2
[236] 5 5 5 5 3 5 5 2 2 2 2 4 2 5 5 2 5 2 5 5 5 2 5 5 4 2 2 2 2 5 5 2 2 4 5 5 3 2 4 2 4 5 2 2 5 2 5
[283] 5 5 2 2 5 2 4 5 3 5 2 2 5 4 2 4 5 5 5 5 5 4 2 5 2 2 2 3 2 2 4 3 2 5 5 2 2 5 2 4 4 2 5 5 2 5 4
[330] 5 5 5 5 5 5 4 5 4 2 2 5 2 2 5 2 2 5 2 2 5 2 2 2 5 5 5 4 2 2 2 5 5 2 2 2 5 5 2 5 2 2 5 2 2 5 5
[377] 4 4 5 5 2 2 4 2 2 2 5 5 5 5 5 2 5 5 5 4 5 5 5 2 2 5 2 4 4 5 2 2 2 4 5 5 5 2 4 5 2 5 3 5 4 4 5
[424] 2 2 5 4 2 5 2 4 4 4 5 3 5 2 5 2 4 2 2 5 5 2 4 2 5 5 2 5 2 2 2 2 2 2 5 2 4 2 5 3 4 5 5 2 5 3 3
[471] 2 5 2 3 5 5 2 5 2 5 2 5 5 5 2 2 5 3 2 2 1 3 5 3 2 2 4 5 5 2 5 5 5 2 2 5 2 4 2 3 3 2 4 5 5 2 5
[518] 5 5 5 5 2 5 2 5 4 2 5 5 2 2 5 4 2 5 2 5 5 3 5 4 2 5 2 2 2 5 3 5 5 2 5 4 5 4 5 2 4 4 5 4 5 5 2
[565] 2 5 5 5 5 2 5 2 5 5 5 2 5 4 2 5 5 2 3 2 3 5 4 5 5 5 2 5 4 5 5 5 3 2 5 5 2 4 5 4 5 5 2 5 2 5 4
[612] 3 2 2 5 1 5 2 5 5 2 5 3 2 5 3 2 5 5 2 5 5 5 5 5 5 2 5 5 2 5 2 5 4 5 4 4 2 2 5 2 2 2 2 5 2 2 5
[659] 2 2 4 5 2 5 5 3 4 5 5 3 2 4 3 4 2 2 2 5 3 5 2 5 2 3 5 2 3 3 2 3 5 2 5 4 5 5 2 5 4 5 5 5 2 4 5
[706] 5 4 5 5 5 4 2 5 3 5 5 4 4 3 3 1 5 2 5 5 5 2 4 3 5 5 5 2 5 5 5 5 2 4 4 4 5 5 2 5 4 2 5 5 2 5 2
[753] 5 4 2 2 5 2 5 5 5 5 5 4 5 5 4 5 5 3 2 5 4 4 4 2 5 5 4 3 5 5 4 2 2 2 5 2 5 4 5 4 5 2 5 3 4 5 4
[800] 2 5 4 5 3 5 2 5 2 2 5 2 5 5 4 5 5 2 5 5 5 2 2 2 2 4 5 5 5 5 4 3 5 5 2 5 5 4 2 5 2 3 2 3 4 2 5
[847] 5 5 5 5 5 2 3 5 4 2 5 5 5 2 2 5 5 5 3 5 5 2 2 2 5 5 2 5 5 2 2 5 5 2 2 4 5 2 4 5 5 4 5 5 3 4 5
[894] 5 2 5 4 4 5 2 1 2 4 5 2 5 5 5 4 5 4 5 5 2 5 5 5 5 5 5 2 5 2 2 5 5 2 2 2 5 5 4 5 5 5 2 2 2 5 5
[941] 2 4 5 5 3 2 5 2 5 5 4 3 5 5 4 2 5 2 2 4 2 5 2 5 3 4 5 5 5 3 3 3 3 5 5 3 5 5 5 5 5 1 4 2 3 2 5
[988] 5 5 5 5 5 5 1 4 3 3 5 2 5
[ reached getOption("max.print") -- omitted 5259 entries ]
Within cluster sum of squares by cluster:
[1] 325.01554 36.02784 184.16775 168.14783 238.32650 189.37178
(between_SS / total_SS = 90.9 %)
Available components:
[1] "cluster" "centers" "totss" "withinss" "tot.withinss" "betweenss"
[7] "size" "iter" "ifault"
library(broom)
Attaching package: ‘broom’
The following object is masked from ‘package:modelr’:
bootstrap
tidy(clustered_computers,
col.names = colnames(computer_scale))
glance(clustered_computers)
augment(clustered_computers, computers)
library(animation)
computer_scale %>%
kmeans.ani(centers = 6)
max_k <- 20
k_clusters <- tibble(k = 1:max_k) %>%
mutate(
kclust = map(k, ~ kmeans(computer_scale, .x, nstart = 25)),
tidied = map(kclust, tidy),
glanced = map(kclust, glance),
augmented = map(kclust, augment, computers)
)
k_clusters
clusterings <- k_clusters %>%
unnest(glanced)
clusterings
ggplot(clusterings, aes(x=k, y=tot.withinss)) +
geom_point() +
geom_line() +
scale_x_continuous(breaks = seq(1, 20, by = 1))
Looks like K might be 3.
library(factoextra)
package ‘factoextra’ was built under R version 3.6.2Welcome! Want to learn more? See two factoextra-related books at https://goo.gl/ve3WBa
fviz_nbclust(computer_scale, kmeans, method = "wss", nstart = 25)
Registered S3 method overwritten by 'data.table':
method from
print.data.table
fviz_nbclust(computer_scale, kmeans, method = "silhouette", nstart = 25)
3 is still looking good and so are 7, 9 and 10
fviz_nbclust(computer_scale, kmeans, method = "gap_stat", nstart = 25, k.max = 10)
Clustering k = 1,2,..., K.max (= 10): .. done
Bootstrapping, b = 1,2,..., B (= 100) [one "." per sample]:
.
did not converge in 10 iterations
..........
Quick-TRANSfer stage steps exceeded maximum (= 312950)
...
did not converge in 10 iterations
.
did not converge in 10 iterations
..
Quick-TRANSfer stage steps exceeded maximum (= 312950)
..
did not converge in 10 iterations
..
did not converge in 10 iterationsdid not converge in 10 iterations
.
did not converge in 10 iterations
.........
did not converge in 10 iterations
....
Quick-TRANSfer stage steps exceeded maximum (= 312950)did not converge in 10 iterations
....
did not converge in 10 iterationsdid not converge in 10 iterations
......
Quick-TRANSfer stage steps exceeded maximum (= 312950)
..... 50
Quick-TRANSfer stage steps exceeded maximum (= 312950)
....
did not converge in 10 iterations
.
did not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterations
.
did not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterations
..
did not converge in 10 iterations
....
did not converge in 10 iterations
..
Quick-TRANSfer stage steps exceeded maximum (= 312950)did not converge in 10 iterationsdid not converge in 10 iterations
.
Quick-TRANSfer stage steps exceeded maximum (= 312950)
.......
did not converge in 10 iterations
....
did not converge in 10 iterations
..
did not converge in 10 iterations
....
did not converge in 10 iterations
..
Quick-TRANSfer stage steps exceeded maximum (= 312950)did not converge in 10 iterationsdid not converge in 10 iterations
.
Quick-TRANSfer stage steps exceeded maximum (= 312950)
.
did not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterations
...
Quick-TRANSfer stage steps exceeded maximum (= 312950)did not converge in 10 iterations
....
did not converge in 10 iterations
.
did not converge in 10 iterations
..
did not converge in 10 iterations
.
did not converge in 10 iterationsdid not converge in 10 iterationsdid not converge in 10 iterations
... 100
7 is the winner or 3
clusterings %>%
unnest(cols = c(augmented)) %>%
filter(k <= 10) %>%
ggplot(aes(x = hd, y = ram)) +
geom_point(aes(color = .cluster)) +
facet_wrap(~ k)
3 in my opinion is the winner.
clusterings %>%
unnest(cols = c(augmented)) %>%
filter(k == 3) %>%
ggplot(aes(x = hd, y = ram, colour = .cluster)) +
geom_point(aes(color = .cluster))
I think the clustering worked well. The 3rd cluster is a but “out there” but it’s still an accurate representation.
clusterings %>%
unnest(augmented) %>%
filter(k == 2) %>%
group_by(.cluster) %>%
summarise(mean(hd), mean(ram))